首页> 外文OA文献 >kClust: Fast and sensitive clustering of large protein sequence databases.

【2h】

kClust: Fast and sensitive clustering of large protein sequence databases.

机译：kClust：快速，灵敏的大蛋白质序列数据库聚类。

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

Background Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable. Results Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed. Conclusions kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite.

机译：背景技术在高通量测序快速发展的推动下，公共序列数据库的规模每两年翻一番。搜索越来越大和更多冗余的数据库的效率越来越低。聚类可以帮助将序列组织成同源且功能相似的组，并可以提高同源性搜索的速度，灵敏度和可读性。但是，由于聚类时间在序列数上是二次的，因此标准序列搜索方法变得不可行。结果在这里，我们提出了一种方法，可以在几天之内将大型蛋白质序列数据库（如UniProt）聚类到最大成对序列同一性的20％–30％。 kClust的速度和灵敏度归因于无需对齐的预过滤器（可计算序列对之间所有相似的6-mer的累积得分），以及可对相似的4-mer进行操作的动态编程算法。为了进一步提高灵敏度，kClust可以在配置文件序列比较模式下运行，其中配置文件是从以前的kClust迭代的聚类中计算得出的。 kClust比基于NCBI BLAST的聚类要快2到3个数量级，并且在最大成对序列同一性为20％–30％的多域序列上，它可以实现相当的灵敏度和更低的错误发现率。就错误发现率，敏感性和速度而言，它也比CD-HIT和UCLUST更好。结论kClust满足了对快速，灵敏和准确的工具的需求，以将大型蛋白质序列数据库聚类到30％以下的序列同一性。 kClust是根据GPL免费提供的，网址为http://toolkit.lmb.uni-muenchen.de/pub/kClust/ webcite。

著录项

作者
Hauser, M.; Mayer, C.; Söding, J.;
展开▼
作者单位

展开▼
年度 2013
总页数
原文格式 PDF
正文语种 eng
中图分类

相似文献

外文文献
中文文献
专利

1. kClust: fast and sensitive clustering of large protein sequence databases [J] . Maria Hauser, Christian E Mayer, Johannes S?ding BMC Bioinformatics . 2013,第1期

机译：kclust：大蛋白质序列数据库的快速和敏感聚类
2. Clustering of highly homologous sequences to reduce the size of large protein databases. [J] . Li W, Jaroszewski L, Godzik A Bioinformatics . 2001,第3期

机译：高度同源序列的聚类以减少大型蛋白质数据库的大小。
3. Markov model recognition and classification of DNA/protein sequences within large text databases. [J] . Wren JD, Hildebrand WH, Chandrasekaran S, Bioinformatics . 2005,第21期

机译：大文本数据库中的马尔可夫模型识别和DNA /蛋白质序列分类。
4. Species delineation by ribosomal protein sequences, and identification of bacteria by MALDI to DNA databases. [C] . Kenneth C. Parker ASMS Conference on Mass Spectrometry and Allied Topics . 2016

机译：用核糖体蛋白序列的物种描绘，并通过MALDI对DNA数据库的细菌鉴定。
5. FAC-PIN: An efficient and fast agglomerative clustering algorithm for protein interaction networks to predict protein complexes and functional modules. [D] . Rahman, Mohammad Shamsur. 2013

机译：FAC-PIN：一种高效且快速的聚集聚类算法，用于蛋白质相互作用网络来预测蛋白质复合物和功能模块。
6. kClust: fast and sensitive clustering of large protein sequence databases [O] . Maria Hauser, Christian E Mayer, Johannes Söding 2013

机译：kClust：大型蛋白质序列数据库的快速灵敏聚类
7. kClust: fast and sensitive clustering of large protein sequence databases [O] . 2013

机译：kClust：大型蛋白质序列数据库的快速灵敏聚类

kClust: Fast and sensitive clustering of large protein sequence databases.

摘要

著录项

相似文献

相关主题

期刊订阅